Precision agriculture is in trend nowadays.It assists farmers in making educated decisions about farming strategies.Data in the dataset choosen will help to create a prediction model that would indicate the best crops to cultivate in a certain farm depending on numerous criteria.
Here , From this data I am curious to learn about
Which crops flourish in high temperatures and which in low temperatures?
Which types of crops do the best in which types of soil?
How much rainfall is required for various crops?
Dataset Name : crop recommendation
Dataset link : https://www.kaggle.com/datasets/aksahaha/crop-recommendation
Author Name : ABHISHEK KUMAR
Data is collected from ICAR(Indian Council of Agriculture Research).And some online search on google.
COLLECTION METHODOLOGY : This information is gathered through speaking with farmers or other agricultural professionals about their experiences cultivating crops under various environmental circumstances.
The cases are observational.This dataset includes data on the amounts of nitrogen, phosphorous, and potassium in the soil as well as information on temperature, humidity, pH, and rainfall and how these factors affect crop development.
Nitrogen - Nitrogen content in soil
Phosphorus - Phosphorous content in soil
Potassium - Potassium content in soil
Temperature - Temperature in degree Celsius
Humidity - Relative humidity in %
ph - ph value of the soil
rainfall - Rainfall in mm
Label - Different types of crops
It is an observational study
#Loading tidyverse library and importing data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(readxl)
proposal_data<- read.csv("~/Desktop/CHECK/Crop_recommendation.csv")
Data looks as expected.
dim(proposal_data)
## [1] 2200 10
Original data contains 2200 observations and 10 variables.
head(proposal_data)
## Nitrogen phosphorus potassium temperature humidity ph rainfall label X
## 1 90 42 43 20.87974 82.00274 6.502985 202.9355 rice NA
## 2 85 58 41 21.77046 80.31964 7.038096 226.6555 rice NA
## 3 60 55 44 23.00446 82.32076 7.840207 263.9642 rice NA
## 4 74 35 40 26.49110 80.15836 6.980401 242.8640 rice NA
## 5 78 42 42 20.13017 81.60487 7.628473 262.7173 rice NA
## 6 69 37 42 23.05805 83.37012 7.073454 251.0550 rice NA
## X.1
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Checkig first 6 rows of data to get an idea about what variables are present.
is.data.frame(proposal_data)
## [1] TRUE
proposal_data2 <- as_tibble(proposal_data)
is_tibble(proposal_data2)
## [1] TRUE
As above dataset is in the form of data frame. so, converting it to tibble form.
str(proposal_data2)
## tibble [2,200 × 10] (S3: tbl_df/tbl/data.frame)
## $ Nitrogen : int [1:2200] 90 85 60 74 78 69 69 94 89 68 ...
## $ phosphorus : int [1:2200] 42 58 55 35 42 37 55 53 54 58 ...
## $ potassium : int [1:2200] 43 41 44 40 42 42 38 40 38 38 ...
## $ temperature: num [1:2200] 20.9 21.8 23 26.5 20.1 ...
## $ humidity : num [1:2200] 82 80.3 82.3 80.2 81.6 ...
## $ ph : num [1:2200] 6.5 7.04 7.84 6.98 7.63 ...
## $ rainfall : num [1:2200] 203 227 264 243 263 ...
## $ label : chr [1:2200] "rice" "rice" "rice" "rice" ...
## $ X : logi [1:2200] NA NA NA NA NA NA ...
## $ X.1 : logi [1:2200] NA NA NA NA NA NA ...
Checking detailed information about data.
sum(is.na(proposal_data2))
## [1] 4400
There are some missing values in data.
colSums(is.na(proposal_data2))
## Nitrogen phosphorus potassium temperature humidity ph
## 0 0 0 0 0 0
## rainfall label X X.1
## 0 0 2200 2200
From above we can see there are two columns given with no information.
clean_data <- proposal_data2[c(1:2200),-c(9,10)]
sum(is.na(clean_data))
## [1] 0
dim(clean_data)
## [1] 2200 8
In the above step, I Removed the extra columns with no data and only considering columns with data. And observe the data after removing extra columns we have only 8 variables.
sum(duplicated(clean_data))
## [1] 0
Duplicate values are not present in data.
summary(clean_data)
## Nitrogen phosphorus potassium temperature
## Min. : 0.00 Min. : 5.00 Min. : 5.00 Min. : 8.826
## 1st Qu.: 21.00 1st Qu.: 28.00 1st Qu.: 20.00 1st Qu.:22.769
## Median : 37.00 Median : 51.00 Median : 32.00 Median :25.599
## Mean : 50.55 Mean : 53.36 Mean : 48.15 Mean :25.616
## 3rd Qu.: 84.25 3rd Qu.: 68.00 3rd Qu.: 49.00 3rd Qu.:28.562
## Max. :140.00 Max. :145.00 Max. :205.00 Max. :43.675
## humidity ph rainfall label
## Min. :14.26 Min. :3.505 Min. : 20.21 Length:2200
## 1st Qu.:60.26 1st Qu.:5.972 1st Qu.: 64.55 Class :character
## Median :80.47 Median :6.425 Median : 94.87 Mode :character
## Mean :71.48 Mean :6.469 Mean :103.46
## 3rd Qu.:89.95 3rd Qu.:6.924 3rd Qu.:124.27
## Max. :99.98 Max. :9.935 Max. :298.56
Summarizing the data to check mean,median and quartile of variables.
clean_data %>% glimpse()
## Rows: 2,200
## Columns: 8
## $ Nitrogen <int> 90, 85, 60, 74, 78, 69, 69, 94, 89, 68, 91, 90, 78, 93, 94…
## $ phosphorus <int> 42, 58, 55, 35, 42, 37, 55, 53, 54, 58, 53, 46, 58, 56, 50…
## $ potassium <int> 43, 41, 44, 40, 42, 42, 38, 40, 38, 38, 40, 42, 44, 36, 37…
## $ temperature <dbl> 20.87974, 21.77046, 23.00446, 26.49110, 20.13017, 23.05805…
## $ humidity <dbl> 82.00274, 80.31964, 82.32076, 80.15836, 81.60487, 83.37012…
## $ ph <dbl> 6.502985, 7.038096, 7.840207, 6.980401, 7.628473, 7.073454…
## $ rainfall <dbl> 202.9355, 226.6555, 263.9642, 242.8640, 262.7173, 251.0550…
## $ label <chr> "rice", "rice", "rice", "rice", "rice", "rice", "rice", "r…
The above data has 7 numerical variables and one categorical variable.
clean_data %>% group_by(label) %>% summarise(count=n())
## # A tibble: 22 × 2
## label count
## <chr> <int>
## 1 apple 100
## 2 banana 100
## 3 blackgram 100
## 4 chickpea 100
## 5 coconut 100
## 6 coffee 100
## 7 cotton 100
## 8 grapes 100
## 9 jute 100
## 10 kidneybeans 100
## # ℹ 12 more rows
print(paste('Standard Deviation of Rainfall: ',sd(clean_data$rainfall)))
## [1] "Standard Deviation of Rainfall: 54.9583885248781"
print(paste('Standard Deviation of Temperature: ',sd(clean_data$temperature)))
## [1] "Standard Deviation of Temperature: 5.06374859995884"
print(paste('Standard Deviation of Humidity: ',sd(clean_data$humidity)))
## [1] "Standard Deviation of Humidity: 22.2638115897611"
print(paste('Standard Deviation of ph: ',sd(clean_data$ph)))
## [1] "Standard Deviation of ph: 0.773937688029873"
print(paste('Standard Deviation of Nitrogen: ',sd(clean_data$Nitrogen)))
## [1] "Standard Deviation of Nitrogen: 36.9173338337566"
print(paste('Standard Deviation of Phosphorous: ',sd(clean_data$phosphorus)))
## [1] "Standard Deviation of Phosphorous: 32.9858827385872"
print(paste('Standard Deviation of Potassium: ',sd(clean_data$potassium)))
## [1] "Standard Deviation of Potassium: 50.6479305466601"
print(paste('Variance of Rainfall: ',var(clean_data$rainfall)))
## [1] "Variance of Rainfall: 3020.42446925146"
print(paste('Variance of Temperature: ',var(clean_data$temperature)))
## [1] "Variance of Temperature: 25.6415498835851"
print(paste('Variance of Humidity: ',var(clean_data$humidity)))
## [1] "Variance of Humidity: 495.67730650438"
print(paste('Variance of ph: ',var(clean_data$ph)))
## [1] "Variance of ph: 0.598979544953026"
print(paste('Variance of Nitrogen: ',var(clean_data$Nitrogen)))
## [1] "Variance of Nitrogen: 1362.88953739303"
print(paste('Variance of Phosphorous: ',var(clean_data$phosphorus)))
## [1] "Variance of Phosphorous: 1088.06846004382"
print(paste('Variance of Potassium: ',var(clean_data$potassium)))
## [1] "Variance of Potassium: 2565.21286865931"
ggplot(data = clean_data) +
geom_bar(mapping = aes(x = label,colour="label"))+coord_flip()
Result- From the above visualization we can see count of samples
collected for each crop.
clean_data %>% group_by(label) %>% summarise(avg = mean(rainfall)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Rainfall level", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization, we can observe which crops need high level of rainfall and which crops doesn’t need much rainfall. For example, Rice and Coconut need high rainfall whereas muskmelon needs very less rainfall to grow. Question from proposal is answered.
clean_data %>% group_by(label) %>% summarise(avg = mean(Nitrogen)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Nitrogen content in land", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization, we can observe which crops need high amount of Nitrogen and which crops doesn’t need much Nitrogen in soil. For example, Cotton and Coffee need high Nitrogen whereas lentil needs very less amount of Nitrogen in soil to grow. So, we can say few crops like cotton,coffee,muskmelon,banana,watermelon and rice will grow well in soil with high amount of nitrogen.
clean_data %>% group_by(label) %>% summarise(avg = mean(potassium)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Potassium content in land", x = "CROPS", y = "AVERAGE") +coord_flip()
Result - From the above visualization we can observe most of the crops doesn’t need much potassium in soil to grow. However, Grapes and Apple need very high amount of potassium in soil to grow.
clean_data %>% group_by(label) %>% summarise(avg = mean(phosphorus)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "phosphorous content in land", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization we can see most of the crops need medium amount of phosphorous in soil to grow. Whereas Apple and grapes need high amount of phosphorous in soil to grow and Orange,Coconut,Watermelon,Muskmelon,Pomegranate need very less amount of phosphorous.
clean_data %>% group_by(label) %>% summarise(avg = mean(temperature)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Temperature", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization we can see few crops like Papaya,Mango,Blackgam,Muskmelon and mungbean will grow in high temperatures. Whereas, No crops above have temperature mean under 18 degrees. So, we can say crops mentioned in data will grow only in high or medium temperatures. Question from proposal is answered through above visual representation.
clean_data %>% group_by(label) %>% summarise(avg = mean(humidity)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "HUMIDITY", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization, we can see most of the crops need high amount of humidity that means moisture in air.Whereas Chickpea,Kidneybeans,Pigeon peas and Mango will need very less amount of moisture.
clean_data %>% group_by(label) %>% summarise(avg = mean(ph)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "ph of crop field", x = "CROPS", y = "AVERAGE") +coord_flip()
Result- From the above visualization, we can see most of the crops in data needs same amount of ph content in soil.
Step - Calculating mean of all samples and grouping to see particularly what amount of climatic conditions and land conditions should be present for a crop to grow.
SETA <- clean_data %>% group_by(label) %>%
summarise(mean_N=mean(Nitrogen),mean_ph=mean(ph),mean_T=mean(temperature),mean_ph=mean(phosphorus),mean_K=mean(potassium),mean_R=mean(rainfall),mean_H=mean(humidity),
.groups = 'drop')
print(SETA)
## # A tibble: 22 × 7
## label mean_N mean_ph mean_T mean_K mean_R mean_H
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 apple 20.8 134. 22.6 200. 113. 92.3
## 2 banana 100. 82.0 27.4 50.0 105. 80.4
## 3 blackgram 40.0 67.5 30.0 19.2 67.9 65.1
## 4 chickpea 40.1 67.8 18.9 79.9 80.1 16.9
## 5 coconut 22.0 16.9 27.4 30.6 176. 94.8
## 6 coffee 101. 28.7 25.5 29.9 158. 58.9
## 7 cotton 118. 46.2 24.0 19.6 80.4 79.8
## 8 grapes 23.2 133. 23.8 200. 69.6 81.9
## 9 jute 78.4 46.9 25.0 40.0 175. 79.6
## 10 kidneybeans 20.8 67.5 20.1 20.0 106. 21.6
## # ℹ 12 more rows
Result- Requirement mentioned in the above step is satisfied as we can clearly see what level of temperature, rainfall, humidity and what amount of nitrogen ,potassium, phosphorous,and ph is need for each crop to grow.
Step- Converting wider data set in to longer data set to get proper visualization of data.
pivoted_data <- SETA %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_data)
## # A tibble: 132 × 3
## label land_condition Amount
## <chr> <chr> <dbl>
## 1 apple mean_N 20.8
## 2 apple mean_ph 134.
## 3 apple mean_T 22.6
## 4 apple mean_K 200.
## 5 apple mean_R 113.
## 6 apple mean_H 92.3
## 7 banana mean_N 100.
## 8 banana mean_ph 82.0
## 9 banana mean_T 27.4
## 10 banana mean_K 50.0
## # ℹ 122 more rows
Result- Requirement mentioned in the above step is satisfied as we can clearly see conversion of data set different soil and climatic conditions are mentioned in the column land_condition and amount or level required is mentioned under column Amount.
sum(is.na(pivoted_data))
## [1] 0
Result- After conversion also there are no na values present.
pivoted_data %>% group_by(label) %>% ggplot() + geom_bar(mapping = aes(x = label, y = Amount,color =land_condition), stat = "identity")+coord_flip()
Result- In the above visual representation, We can clearly see land condition of each crop, all conditions for a crop to grow healthily are showed with amount by differentiating each condition with colour.Each bar above represents a condition.
ggplot(data = pivoted_data) +geom_point(mapping = aes(x=Amount,y=label,colour =land_condition),alpha = 0.7,show.legend = TRUE)
Result - Same representation is done using points and differentiating conditions with colour to get more clear picture of what amount of each condition is required.
ggplot(data = SETA) +geom_point(mapping = aes(x=label,y=mean_H,colour =mean_N ))+coord_flip()
Result- From the above visual representation we can say there is no proper relation between nitrogen and humidity. In some cases where nitrogen percentage is high in land humidity is more.
ggplot(data = SETA) +geom_point(mapping = aes(x=mean_R,y=label,colour =mean_T))
Result- Above graph is plotted to understand the relation between rainfall and temperature, whether temperature is high or low, certain amount of rainfall is needed for each crop.so we can say there is no relation between temperature and rainfall.
ggplot(data=pivoted_data)+geom_line(mapping=aes(x=land_condition,y=Amount,color=label),group =1)
Result- In the above graph, conditions for different crops is mentioned using lines but representation is not clear as there are many number of crops.
Step - As above representation is not clear, we are dividing the data in to subsets.
#To make data more efficient dividing data in to subsets
subsetF<- filter(SETA, label %in% c("apple","banana","blackgram","chickpea","coconut","coffee","cotton","grapes","jute","kidneybeans","lentil"))
subsetF
## # A tibble: 11 × 7
## label mean_N mean_ph mean_T mean_K mean_R mean_H
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 apple 20.8 134. 22.6 200. 113. 92.3
## 2 banana 100. 82.0 27.4 50.0 105. 80.4
## 3 blackgram 40.0 67.5 30.0 19.2 67.9 65.1
## 4 chickpea 40.1 67.8 18.9 79.9 80.1 16.9
## 5 coconut 22.0 16.9 27.4 30.6 176. 94.8
## 6 coffee 101. 28.7 25.5 29.9 158. 58.9
## 7 cotton 118. 46.2 24.0 19.6 80.4 79.8
## 8 grapes 23.2 133. 23.8 200. 69.6 81.9
## 9 jute 78.4 46.9 25.0 40.0 175. 79.6
## 10 kidneybeans 20.8 67.5 20.1 20.0 106. 21.6
## 11 lentil 18.8 68.4 24.5 19.4 45.7 64.8
#To make data more efficient dividing data in to subsets
subsetG<- filter(SETA, label %in% c("maize","mango","mothbeans","mungbean","muskmelon","orange","papaya","pigeonpeas","pomegranate","rice","watermelon"))
subsetG
## # A tibble: 11 × 7
## label mean_N mean_ph mean_T mean_K mean_R mean_H
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 maize 77.8 48.4 22.4 19.8 84.8 65.1
## 2 mango 20.1 27.2 31.2 29.9 94.7 50.2
## 3 mothbeans 21.4 48.0 28.2 20.2 51.2 53.2
## 4 mungbean 21.0 47.3 28.5 19.9 48.4 85.5
## 5 muskmelon 100. 17.7 28.7 50.1 24.7 92.3
## 6 orange 19.6 16.6 22.8 10.0 110. 92.2
## 7 papaya 49.9 59.0 33.7 50.0 143. 92.4
## 8 pigeonpeas 20.7 67.7 27.7 20.3 149. 48.1
## 9 pomegranate 18.9 18.8 21.8 40.2 108. 90.1
## 10 rice 79.9 47.6 23.7 39.9 236. 82.3
## 11 watermelon 99.4 17 25.6 50.2 50.8 85.2
Result- So, we divided data in to two subsets F an G. Each dataset contains 11 crops.
pivoted_dataA <- subsetF %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_dataA)
## # A tibble: 66 × 3
## label land_condition Amount
## <chr> <chr> <dbl>
## 1 apple mean_N 20.8
## 2 apple mean_ph 134.
## 3 apple mean_T 22.6
## 4 apple mean_K 200.
## 5 apple mean_R 113.
## 6 apple mean_H 92.3
## 7 banana mean_N 100.
## 8 banana mean_ph 82.0
## 9 banana mean_T 27.4
## 10 banana mean_K 50.0
## # ℹ 56 more rows
pivoted_dataB <- subsetG %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_dataB)
## # A tibble: 66 × 3
## label land_condition Amount
## <chr> <chr> <dbl>
## 1 maize mean_N 77.8
## 2 maize mean_ph 48.4
## 3 maize mean_T 22.4
## 4 maize mean_K 19.8
## 5 maize mean_R 84.8
## 6 maize mean_H 65.1
## 7 mango mean_N 20.1
## 8 mango mean_ph 27.2
## 9 mango mean_T 31.2
## 10 mango mean_K 29.9
## # ℹ 56 more rows
Result- Converted wider data subsets in to longer data subsets.
ggplot(data=pivoted_dataA)+geom_line(mapping=aes(x=Amount, y=land_condition, group=label, color=land_condition))+ facet_wrap(~ label)
Result - From the above visual representation we can clearly see no two groups are matching in every condition. For suppose lentil, grapes and apple have same temperature but in remaining conditions they are not similar. Question from proposal is answered.
ggplot(data=pivoted_dataB)+geom_line(mapping=aes(x=Amount, y=land_condition, group=label, color=land_condition))+ facet_wrap(~ label)
Result - From the above visual representation, we can clearly see mothbeans and mungbeans requires almost same climatic and land conditions.Whereas remaining crops have totally different land and climatic conditions.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(subsetF)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Result- From the above visual representation, we can observe both positive and negative correlation between variables. for suppose temperature and nitrogen have positive correlation, whereas ph and nitrogen have negative correlation.
ggplot(data = pivoted_dataA, mapping = aes(x = Amount, y = land_condition,group=label,colour=label)) + geom_boxplot()
Result- This visualization is done to check which two crops having similar conditions, all the conditions of jute and blackgram are in similar range but slightly varies, same with grapes and apple. so we can say for almost all crops conditions varies in one or other.
ggplot(data = pivoted_dataB, mapping = aes(x = Amount, y = land_condition,group=label,colour=label)) + geom_boxplot()
Step - A sample of 2 crops, calculating mean , standard deviations to perform two tailed test on rainfall for apple and banana .
Null hypothesis: The mean of apple rainfall is equal to banana rainfall.
Alternative hypothesis: The mean of apple rainfall is not equal to banana rainfall.
Apple <- clean_data$rainfall[clean_data$label == "apple"]
Banana <- clean_data$rainfall[clean_data$label == "banana"]
print(Mean1 <- mean(Apple))
## [1] 112.6548
print(Mean2 <- mean(Banana))
## [1] 104.627
sd1 <- sd(Apple)
print(paste('Standard Deviation of Apple rainfall: ', sd1))
## [1] "Standard Deviation of Apple rainfall: 7.10298539071806"
round(sd1,digits = 2)
## [1] 7.1
sd2 <- sd(Banana)
print(paste('Standard Deviation of Banana rainfall: ', sd2))
## [1] "Standard Deviation of Banana rainfall: 9.39814957319825"
round(sd2,digits = 2)
## [1] 9.4
n<-2200
SE <- sqrt((sd1^2/n) + (sd2^2/n))
SE
## [1] 0.2511588
As it is two sample t- test alpha value is divided in to two that is 0.05/2 = 0.025 and zscore is taken from t- table based on that
X1 <- Mean1
X2 <- Mean2
t <- (X1-X2)/SE
alpha = 0.05
zscore <- 1.96
#As it is two tailed test, if Z is less than -1.96 or if Z is greater than 1.96 we reject Null Hypothesis.
t<-round(t,digits = 2)
t
## [1] 31.96
print(degreesoffreedom <- (n+n)-2)
## [1] 4398
Result - From t- statistics we can conclude that t value is greater than 1.96 and also p- value is less than alpha value (Refer to references - Using a t-distribution table you can find that the p-value is much less than 0.001 (p < 0.001).), so we are rejecting null hypothesis.
Step : So ,people from particular place claimed that ph ranges from 6.47 and 6.48 and mean of ph is 6.469 from our data .
Null hypothesis: The mean ph is equal to 6.48.
Alternative hypothesis: The mean ph is less than 6.48.
alpha = 0.05
zscore = 1.64
t.test(clean_data$ph, mu = 6.48, alternative = "less")
##
## One Sample t-test
##
## data: clean_data$ph
## t = -0.63756, df = 2199, p-value = 0.2619
## alternative hypothesis: true mean is less than 6.48
## 95 percent confidence interval:
## -Inf 6.496632
## sample estimates:
## mean of x
## 6.46948
Result - Here, true mean is less than 6.48 and p value is greater than alpha value. so , in this case null hypothesis is accepted.
plot(x = SETA$mean_T,y = SETA$mean_R,
xlab = "Temperature",
ylab = "Rainfall",
main = "Temperature vs Rainfall"
)
Result - No relation is seen among the variables.
ggplot(SETA, aes(x = mean_T,y = mean_R,)) + geom_point(color= "red") + geom_smooth(method = "lm",color="blue")
## `geom_smooth()` using formula = 'y ~ x'
B <- lm(SETA$mean_R~SETA$mean_T)
summary(B)
##
## Call:
## lm(formula = SETA$mean_R ~ SETA$mean_T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.66 -34.45 -3.12 35.38 131.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.2623 82.2271 1.475 0.156
## SETA$mean_T -0.6948 3.1793 -0.219 0.829
##
## Residual standard error: 53.18 on 20 degrees of freedom
## Multiple R-squared: 0.002382, Adjusted R-squared: -0.0475
## F-statistic: 0.04776 on 1 and 20 DF, p-value: 0.8292
Result From the regression we get the equation yˆ=121.2623-0.06948x, where β0 = 121.2623 and β1 =-0.06948 .
Summary_Data <- summary(B) # capture model summary as an object
CoefficientsB <- Summary_Data$coefficients # model coefficients
beta.estimate <- CoefficientsB["SETA$mean_T", "Estimate"] # get beta estimate for speed
standard_error <- CoefficientsB["SETA$mean_T", "Std. Error"] # get std.error for speed
t_value <- beta.estimate/standard_error # calc t statistic
t_value
## [1] -0.2185447
qt(p = .025, df = 20)
## [1] -2.085963
Result - When |t|>=|t_0.05,20|, we reject the null hypothesis at 5% significance level. We get |t|>=|t_0.05,20| as |-0.2185447|>|-2.085963| from the t-distribution output. As a result, the Null Hypothesis is rejected, and we infer that the regression line fitted to the data is significant.
qt(p = .005, df = 20)
## [1] -2.84534
Result - When |t|>=|t_0.01,20|, we reject the null hypothesis at 1% significance level. We get |t|>=|t_0.05,20| as |-0.2185447|>|-2.84534| from the t-distribution output. As a result, the Null Hypothesis is rejected, and we infer that the regression line fitted to the data is significant.
cor(SETA$mean_T,SETA$mean_R)
## [1] -0.04880984
confint(B, level=.95)
## 2.5 % 97.5 %
## (Intercept) -50.260415 292.78496
## SETA$mean_T -7.326705 5.93707
cbind(B$residuals, B$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = B$residuals, x = B$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)
SETA_sqrt <- sqrt(SETA$mean_R)
D1 <- lm(SETA_sqrt~SETA$mean_T)
summary(D1)
##
## Call:
## lm(formula = SETA_sqrt ~ SETA$mean_T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7315 -1.5590 0.1505 2.0042 5.4013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.23613 4.01645 2.798 0.0111 *
## SETA$mean_T -0.05358 0.15530 -0.345 0.7337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.598 on 20 degrees of freedom
## Multiple R-squared: 0.005916, Adjusted R-squared: -0.04379
## F-statistic: 0.119 on 1 and 20 DF, p-value: 0.7337
Result - From the above we get the equation yˆ= 11.23613-0.05358 x
cbind(D1$residuals, D1$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = D1$residuals, x = D1$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)
There is no much difference in spread of fitted values and residuals but
we can observe increase in residuals.
new_data <- clean_data[1:7]
scaled_data = scale(new_data)
head(scaled_data)
## Nitrogen phosphorus potassium temperature humidity ph
## [1,] 1.0685545 -0.34447243 -0.10166439 -0.9353743 0.4725590 0.04329189
## [2,] 0.9331167 0.14058356 -0.14115268 -0.7594734 0.3969610 0.73470553
## [3,] 0.2559281 0.04963556 -0.08192025 -0.5157809 0.4868431 1.77110780
## [4,] 0.6351537 -0.55668443 -0.16089682 0.1727678 0.3897169 0.66015759
## [5,] 0.7435039 -0.34447243 -0.12140853 -1.0834008 0.4546883 1.49752731
## [6,] 0.4997160 -0.49605243 -0.12140853 -0.5051979 0.5339759 0.78039027
## rainfall
## [1,] 1.809949
## [2,] 2.241548
## [3,] 2.920402
## [4,] 2.536471
## [5,] 2.897714
## [6,] 2.685511
Data_A <-as.matrix(new_data)
CovarianceMatrix = cov(Data_A)
print("CovarianceMatrix")
## [1] "CovarianceMatrix"
CovarianceMatrix
## Nitrogen phosphorus potassium temperature humidity
## Nitrogen 1362.889537 -281.860096 -262.72715 4.95462225 156.730700
## phosphorus -281.860096 1088.068460 1229.99865 -21.30347754 -87.197323
## potassium -262.727147 1229.998647 2565.21287 -41.13422930 215.215502
## temperature 4.954622 -21.303478 -41.13423 25.64154988 23.147400
## humidity 156.730700 -87.197323 215.21550 23.14740049 495.677307
## ph 2.762395 -3.523487 -6.64424 -0.06973913 -0.146161
## rainfall 119.747146 -115.730685 -148.81121 -8.37217973 115.534462
## ph rainfall
## Nitrogen 2.76239482 119.747146
## phosphorus -3.52348679 -115.730685
## potassium -6.64424046 -148.811212
## temperature -0.06973913 -8.372180
## humidity -0.14616095 115.534462
## ph 0.59897954 -4.639202
## rainfall -4.63920157 3020.424469
Result - Covariance of vectors is calcualted to observe how one variable differe from another, that is to observe the relation between one value and other value. Here, I observed both positive and negative covariances in data
Step - Correlation is calculated below to observe to check whether there is any trend between variables
CorrelationMatrix = cor(Data_A)
print("CorrelationMatrix")
## [1] "CorrelationMatrix"
CorrelationMatrix
## Nitrogen phosphorus potassium temperature humidity
## Nitrogen 1.00000000 -0.23145958 -0.14051184 0.02650380 0.190688379
## phosphorus -0.23145958 1.00000000 0.73623222 -0.12754113 -0.118734116
## potassium -0.14051184 0.73623222 1.00000000 -0.16038713 0.190858861
## temperature 0.02650380 -0.12754113 -0.16038713 1.00000000 0.205319677
## humidity 0.19068838 -0.11873412 0.19085886 0.20531968 1.000000000
## ph 0.09668285 -0.13801889 -0.16950310 -0.01779502 -0.008482539
## rainfall 0.05902022 -0.06383905 -0.05346135 -0.03008378 0.094423053
## ph rainfall
## Nitrogen 0.096682846 0.05902022
## phosphorus -0.138018893 -0.06383905
## potassium -0.169503098 -0.05346135
## temperature -0.017795017 -0.03008378
## humidity -0.008482539 0.09442305
## ph 1.000000000 -0.10906948
## rainfall -0.109069484 1.00000000
transversecov <- t(CovarianceMatrix)
transversecov
## Nitrogen phosphorus potassium temperature humidity
## Nitrogen 1362.889537 -281.860096 -262.72715 4.95462225 156.730700
## phosphorus -281.860096 1088.068460 1229.99865 -21.30347754 -87.197323
## potassium -262.727147 1229.998647 2565.21287 -41.13422930 215.215502
## temperature 4.954622 -21.303478 -41.13423 25.64154988 23.147400
## humidity 156.730700 -87.197323 215.21550 23.14740049 495.677307
## ph 2.762395 -3.523487 -6.64424 -0.06973913 -0.146161
## rainfall 119.747146 -115.730685 -148.81121 -8.37217973 115.534462
## ph rainfall
## Nitrogen 2.76239482 119.747146
## phosphorus -3.52348679 -115.730685
## potassium -6.64424046 -148.811212
## temperature -0.06973913 -8.372180
## humidity -0.14616095 115.534462
## ph 0.59897954 -4.639202
## rainfall -4.63920157 3020.424469
Orthogonal_check = CovarianceMatrix%*%transversecov
Orthogonal_check
## Nitrogen phosphorus potassium temperature humidity
## Nitrogen 2044874.629 -1041621.49 -1363017.68 26316.4970 273278.179
## phosphorus -1041621.492 2797697.98 4566938.91 -76766.6804 68576.767
## potassium -1363017.675 4566938.91 8232437.95 -127849.7451 492177.013
## temperature 26316.497 -76766.68 -127849.75 3433.8007 4881.335
## humidity 273278.179 68576.77 492177.01 4881.3351 338065.626
## ph 5926.462 -12235.79 -21445.73 395.6818 -1299.891
## rainfall 614659.449 -702147.83 -979774.86 -13647.5637 402870.793
## ph rainfall
## Nitrogen 5926.46170 614659.45
## phosphorus -12235.79248 -702147.83
## potassium -21445.73322 -979774.86
## temperature 395.68185 -13647.56
## humidity -1299.89101 402870.79
## ph 86.09891 -12304.14
## rainfall -12304.13759 9186281.55
EigenValuescovariance = eigen(CovarianceMatrix)
EigenValuescovariance
## eigen() decomposition
## $values
## [1] 3434.457714 2933.051775 1349.006967 566.260449 252.160060 23.008944
## [7] 0.567262
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.180735334 -0.028231188 0.946431143 0.2656764987 -0.013517602
## [2,] -0.440880097 0.207408888 -0.057190143 0.5601808540 0.667049900
## [3,] -0.758764968 0.393266840 0.214289728 -0.2272182955 -0.413334754
## [4,] 0.010980213 -0.009163157 0.002162218 -0.0349757509 0.075558593
## [5,] -0.015315872 0.067715774 0.224545325 -0.7493652958 0.614766109
## [6,] 0.001466901 -0.002582226 0.001242598 0.0003892233 0.001401021
## [7,] 0.443709263 0.892664376 -0.068194347 0.0348346648 -0.019174321
## [,6] [,7]
## [1,] 0.006060655 -0.0015397910
## [2,] -0.024029262 -0.0001324106
## [3,] 0.034854411 0.0028681784
## [4,] 0.996377177 0.0095348819
## [5,] -0.072607445 -0.0013559969
## [6,] -0.009704231 0.9999466735
## [7,] 0.006127128 0.0018117827
Result - Calculating eigen value for covariance, from that we can see eigen composition is divided in to values and vectors
Valuescov <- eigen(CovarianceMatrix)$values
is.nan(Valuescov)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Valuescov
## [1] 3434.457714 2933.051775 1349.006967 566.260449 252.160060 23.008944
## [7] 0.567262
Vectorscov <-eigen(CovarianceMatrix)$vectors
Vectorscov
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.180735334 -0.028231188 0.946431143 0.2656764987 -0.013517602
## [2,] -0.440880097 0.207408888 -0.057190143 0.5601808540 0.667049900
## [3,] -0.758764968 0.393266840 0.214289728 -0.2272182955 -0.413334754
## [4,] 0.010980213 -0.009163157 0.002162218 -0.0349757509 0.075558593
## [5,] -0.015315872 0.067715774 0.224545325 -0.7493652958 0.614766109
## [6,] 0.001466901 -0.002582226 0.001242598 0.0003892233 0.001401021
## [7,] 0.443709263 0.892664376 -0.068194347 0.0348346648 -0.019174321
## [,6] [,7]
## [1,] 0.006060655 -0.0015397910
## [2,] -0.024029262 -0.0001324106
## [3,] 0.034854411 0.0028681784
## [4,] 0.996377177 0.0095348819
## [5,] -0.072607445 -0.0013559969
## [6,] -0.009704231 0.9999466735
## [7,] 0.006127128 0.0018117827
EigenValuescorrelation = eigen(CorrelationMatrix)
EigenValuescorrelation
## eigen() decomposition
## $values
## [1] 1.9312182 1.2939102 1.0765093 1.0228912 0.8059284 0.6765616 0.1929812
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.30219096 -0.33410693 -0.1120450 0.54165059 0.50778466 0.48290443
## [2,] -0.64378667 -0.03435809 -0.1099391 0.04629318 -0.08233115 0.37684700
## [3,] -0.62260719 -0.28382920 -0.1631733 0.15486709 -0.03342452 0.02896707
## [4,] 0.21242839 -0.35948683 -0.2482280 -0.69082649 -0.15486542 0.50041798
## [5,] 0.06848339 -0.73791663 -0.2135991 0.06717140 -0.12887133 -0.54787098
## [6,] 0.22694272 0.22065738 -0.5485203 0.39570047 -0.65188053 0.12571195
## [7,] 0.07253163 -0.29015800 0.7352670 0.20531846 -0.51838188 0.23992979
## [,7]
## [1,] 0.008472888
## [2,] 0.649104376
## [3,] -0.692268474
## [4,] -0.111281619
## [5,] 0.289624027
## [6,] -0.040027859
## [7,] -0.038576857
Result - Calculating eigen value for correlation, from that we can see eigen composition is divided in to values and vectors
Valuescor <- eigen(CorrelationMatrix)$values
is.nan(Valuescor)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Valuescor
## [1] 1.9312182 1.2939102 1.0765093 1.0228912 0.8059284 0.6765616 0.1929812
Vectorscor <- eigen(CorrelationMatrix)$vectors
Vectorscor
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.30219096 -0.33410693 -0.1120450 0.54165059 0.50778466 0.48290443
## [2,] -0.64378667 -0.03435809 -0.1099391 0.04629318 -0.08233115 0.37684700
## [3,] -0.62260719 -0.28382920 -0.1631733 0.15486709 -0.03342452 0.02896707
## [4,] 0.21242839 -0.35948683 -0.2482280 -0.69082649 -0.15486542 0.50041798
## [5,] 0.06848339 -0.73791663 -0.2135991 0.06717140 -0.12887133 -0.54787098
## [6,] 0.22694272 0.22065738 -0.5485203 0.39570047 -0.65188053 0.12571195
## [7,] 0.07253163 -0.29015800 0.7352670 0.20531846 -0.51838188 0.23992979
## [,7]
## [1,] 0.008472888
## [2,] 0.649104376
## [3,] -0.692268474
## [4,] -0.111281619
## [5,] 0.289624027
## [6,] -0.040027859
## [7,] -0.038576857
Squareroot <- Vectorscov %*% diag(sqrt(Valuescov)) %*% t(Vectorscov)
Squareroot
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 36.53938440 -3.57722281 -2.5357817 -0.00291165 2.668100755 0.063397804
## [2,] -3.57722281 28.37690443 16.1607979 -0.17189899 -2.784237689 -0.048469178
## [3,] -2.53578166 16.16079785 47.7497771 -0.80661723 3.895165740 -0.121208300
## [4,] -0.00291165 -0.17189899 -0.8066172 4.89369222 0.988652472 -0.035518176
## [5,] 2.66810076 -2.78423769 3.8951657 0.98865247 21.503469040 0.008556504
## [6,] 0.06339780 -0.04846918 -0.1212083 -0.03551818 0.008556504 0.754118104
## [7,] 1.18887505 -1.03342478 -1.3162204 -0.18558547 1.502518400 -0.088829509
## [,7]
## [1,] 1.18887505
## [2,] -1.03342478
## [3,] -1.31622036
## [4,] -0.18558547
## [5,] 1.50251840
## [6,] -0.08882951
## [7,] 54.89909606
Result - Observed the square root of covariances eigen values with the help of spectral decomposition method and also here I can see both positive and negative values
Percentage_Variance<-EigenValuescovariance$values / sum(EigenValuescorrelation$values)
# Percent variance explained
Percentage_Variance
## [1] 490.63681627 419.00739648 192.71528106 80.89434981 36.02286571
## [6] 3.28699199 0.08103743
cumsum(Percentage_Variance)
## [1] 490.6368 909.6442 1102.3595 1183.2538 1219.2767 1222.5637 1222.6447
plot(Percentage_Variance)
plot(cumsum((Percentage_Variance)))
From the above graph, we can observe after the seventh value, plot is stable. So key values can be taken as 1:4. From Variabilitycor we can observe range also
So, Here we considered 1-4 columns as the principle components
eigenvectors2 = EigenValuescorrelation$vectors[,1:4]
eigenvectors2
## [,1] [,2] [,3] [,4]
## [1,] 0.30219096 -0.33410693 -0.1120450 0.54165059
## [2,] -0.64378667 -0.03435809 -0.1099391 0.04629318
## [3,] -0.62260719 -0.28382920 -0.1631733 0.15486709
## [4,] 0.21242839 -0.35948683 -0.2482280 -0.69082649
## [5,] 0.06848339 -0.73791663 -0.2135991 0.06717140
## [6,] 0.22694272 0.22065738 -0.5485203 0.39570047
## [7,] 0.07253163 -0.29015800 0.7352670 0.20531846
From the above graph, we can observe after the seventh value, plot is stable. So key values can be taken as 1:4. From Variabilitycor we can observe range also
colnames(eigenvectors2) = c("e1", "e2", "e3","e4")
PC1 <- as.matrix(scaled_data) %*% eigenvectors2[,1]
#PC1
PC2 <- as.matrix(scaled_data) %*% eigenvectors2[,2]
#PC2
PC3 <- as.matrix(scaled_data) %*% eigenvectors2[,3]
#PC3
PC4 <- as.matrix(scaled_data) %*% eigenvectors2[,4]
#PC4
PC <- data.frame(PC1, PC2, PC3,PC4)
head(PC)
## PC1 PC2 PC3 PC4
## 1 0.5827370 -0.8443937 1.373031 1.6137623
## 2 0.4745270 -0.7847161 1.251893 1.7923547
## 3 0.6339243 -0.6943646 1.179064 1.8176923
## 4 1.0476815 -1.0874105 1.393035 0.9821774
## 5 0.8730591 -0.6585232 1.455354 2.3344808
## 6 0.8470899 -0.9348971 1.576211 1.4739631
Result - From the result we can observe the data of key components for all variables, In the first one, third one, fourth one all the values are related as all of them are positive.
Learnings - This dataset helped me learn about different types of crops, climatic conditions and land conditions we check for a particular plant to grow, I also came to know that certain plants only grow in few climatic conditions . And also I learned to visually represent some data we need to understand and we should be clear about what we need to present. And from matrix operations I came to know about difference between covariance and correlation.
Limitaions- This dataset is very useful for farmers and agriculture students, but the data is limited and relationship between the variables is not clear, these looks like random conditions not dependent ones.
Future Work - After analyzing the data I came to conclusion that It has more scope like we can mention which region or which country these crops will grow and also more data is required to check the relation between variables.
Conclusion - This projects helps in finding what amount of rainfall and temperature is needed for a plant to grow and also land conditions included like nitrogen,ph. Although only few variety of plants are mentioned if research is done in a better way this data set has more scope.
Below reference helped me to get proper understanding about why this data is collected and how comparision is done in different categories.
https://www.kaggle.com/code/atharvaingle/what-crop-to-grow
ph value reference for one sample test -
https://nutrientstewardship.org/implementation/soil-ph-and-the-availability-of-plant-nutrients/#:~:text=It%20has%20been%20determined%20that,compatible%20to%20plant%20root%20growth.
t-distribution table -